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Abstract. We investigate the problem of the maximum number of cubic 
subwords (of the form www) in a given word. We also consider square 
subwords (of the form ww). The problem of the maximum number of 
squares in a word is not well understood. Several new results related to 
this problem are produced in the paper. We consider two simple problems 
related to the maximum number of subwords which are squares or which 
are highly repetitive; then we provide a nontrivial estimation for the 
number of cubes. We show that the maximum number of squares xx 
such that X is not a primitive word (nonprimitive squares) in a word of 
length n is exactly [^J — 1, and the maximum number of subwords of the 
form x'', for > 3, is exactly n — 2. In particular, the maximum number 
of cubes in a word is not greater than n — 2 either. Using very technical 
properties of occurrences of cubes, we improve this bound significantly. 
We show that the maximum number of cubes in a word of length n is 



1 Introduction 

A repetition is a word composed (as a concatenation) of several copies of an- 
other word. The exponent is the number of copies. We are interested in natural 
exponents higher than 2. In [4] the authors considered also exponents which are 
not integer. 

In this paper we investigate the bounds for the maximum number of highly 
repetitive subwords in a word of length n. A word is highly repetitive iff it is 
of the form x'^ for some integer k greater than 2. In particular, cubes w"^ and 
squares with nonprimitive x are highly repetitive. 

The subject of computing maximum number of squares and repetitions in 
words is one of the fundamental topics in combinatorics on words [16, 20] initiated 
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by A. Thue [27], as well as it is important in other areas: lossless compression, 
word representation, computational biology etc. 

The behaviour of the function squares(n) of maximum number of squares in 
a word of length n is not well understood, though the subject of squares was 
stiuiicd by many authors, see [7, 8, 15, 23]. The best known results related to the 
value of squares(n) are, see [11, 13, 14]: 



In this paper we concentrate on larger powers of words and show that in this 
case we can have much better estimations. Let cubes(n) denote the maximum 
number of cubes in a word of length n. We show that: 



There are known efficient algorithms for the computation of integer powers 
in words, sec [1,3,9,21,22]. 

The powers in words are related to maximal repetitions, also called runs. It 
is surprising that the bounds for the number of runs are much tighter than for 
squares, this is due to the work of many people [2, 5, 6, 12, 17, 18, 24-26]. 

Our main result is a new estimation of the number of cubic subwords. We 
use a new interesting technique in the analysis: the proof of the upper bound is 
reduced to the proof of an invariant of some abstract algorithm (in our invariant 
lemma). There is still some gap between uppcir and lower bound but it is much 
smaller than the corresponding gap for the number of squares. 
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Fig. 1. Example of a word with 11 distinct cubes. This is a word of length 30 with 
maximal number of cubes among binary words of the same length. 



2 Periodicities in strings 

We consider words over a finite alphabet ^, m G by £ we denote an empty 
word. The positions in a word u are numbered from 1 to \u\. For u = ui . . .Uk,hy 
u[i. . j] we denote a subword of u equal to . . . Uj; in particular, u[i] = u[i. .i]. 

We say that a positive integer p is a period of a word u = ui ... Uk if Ui = Ui+p 
holds for 1 < i < k — p. U w'^ = u {k is a nonnegative integer) then we say that 
u is the A;*'' power of the word w. 



n — o{n) < squares(n) < 2n — O(logn) . 






The primitive root of a word u, denoted root(u), is the shortest word w, such 
that w'' = u for some positive k. We call a word u primitive if root(u) = u, 
otherwise it is called nonprimitive. It can be proved that the primitive root of a 
word u is the only primitive word w, such that w'^ = u for some positive k. 

A square is the 2"'^ power of some word, and an np-square (a nonprimitive 
square) is a square of a word that is not primitive. A cube is a 3'"'' power of some 
word. 

In this paper we focus on the last occnrrences of subwords. Hence, whenever 
we say that word u occurs at position i of the word v we mean its last occurrence, 
that is v[i. .i + \u\ — 1] = u and v[j. . j + |m| — 1] u for j > i. The following 
lemma is used extensively throughout the article. 

Lemma 1 (Periodicity lemma [10, 20]). If a word of length n has two periods 
p and q, such that p + q < n + gcd{p, q), then gcd(p, q) is also a period of the 
word. 

In this paper we often use, so called, weak version of this lemma, where we only 
assume that p + q <n. 

3 Basic properties of highly repetitive subwords 

A word is said to be highly repetitive (hr-word) if it is a /c'^ power of a nonempty 
word, for A; > 3. 




Fig. 2. The situation when one hr-word is a (long) prefix of another hr-word implies 
that root(x) = root(j/), consequently a; is a sufRx of y. 



Lemma 2. // a hr-word x is a prefix of a hr-word y and \x\ > \y\ — \root{y)\, 
then X is also a suffix of y. 

Proof. Due to the periodicity lemma, both words have the same smallest period 

and it is a common divisor of the lengths of their primitive roots, see Figure 2. 
Consequently, we have root(x) = root(y) and x is a suffix of y. □ 

Lemma 3. Assume that x and y are two hr-words, where y = and x is a 
subword of y starting at position i and ending at position j such that 

+ 1 and j > \z^\ . 



i < 



\root{z)\ 



Then, \root{x)\ = \root{y)\. 



beginning of y 



Fig. 3. The situation from Lemma 3. 



Proof. Let X = w'^, for some fc > 3. Using the inequalities on i and j from the 
lemma, we obtain: 



|a;| = j + 1 > l^^l + l- 



|root(z) 
2 



-1 + 1 > 



> 2-\z\- 



1 > 2.H-M = 



Let us also observe that |root(a;)| and |root(?/)| are both periods of a;. Moreover: 

\x\ = \w''\ = H + ^-N > \w\ + l-\x\ > 

> \w\ + \z\ > |root(a;)| + |root(y)| . 

From this, by the periodicity lemma, we obtain that gcd([root(3;)|, |root(y)|) is 
also a period of cc. However, root(x) and root(y) are subwords of x, so |root(a;)| = 
|root(j/)|, since in the opposite case one of the words root(x), root(y) would not 
be primitive. □ 

4 Simple bounds for highly repetitive subwords 

In this section we give some simple estimations of the number of square subwords 
with nonprimitive roots and cubic subwords. 

Lemma 4. Let u be a word. Let us consider highly repetitive subwords of u of 
the form v'', for k > 3 and v primitive. For each such subword we consider its 
(last) occurrence in u. For each position i in u, at most one such subword can 
have its (last) occurrence at position i. 

Proof. Let us assume that we have two different hr-words x and y with their last 
occurrences starting at position i, and let us assume that x is shorter. Then, we 
have \x\ > \y\ — \root{y)\, otherwise the considered occurrence of x would not be 
the last one. 

Now we can apply Lemma 2 — a; is not only a prefix of y, but also its suffix. 

Hence, x appears later in the text and the last occurrence of x in u docs not 
start at position i. This contradiction proves that the assumption that the last 
occurrences of x and y start at position i is false. □ 



The following fact is a consequence of Lemma 4. 



Theorem 1. The maximum number of highly repetitive subwords of a word of 
length n>2 is exactly n — 2. 

Proof. Prom Lemma 4 we know that at each position there can be at most one 

last occurrence of a nonempty hr-word. Moreover, the minimum possible length 
of such a word is 3. Therefore, there can be no such occurrences at positions n 
and n — 1. On the other hand, this upper bound is reached by the word a". □ 

As a corollary, we obtain a simple upper bound for the number of cubes, 
since cubes are hr-words. 

Corollary 1. Let us consider a word u of length n. The number of nonempty 

cubes appearing in u is not greater than n — 2. 

We improve this upper bound substantially in the next sections. However, it 
requires a lot of technicalities. Another implication of Theorem 1 is a tight 
bound for the number of np-squares. 

Theorem 2. Let u be a word of length n. The maximum number of nonempty 

np-squares appearing in u is exactly [^^J — 1. 

Proof. Each nonempty np-square can be viewed as v^^ for some nonempty prim- 
itive V and i>2. However, each such np-square contains a subword which 
is not an np-square (due to the periodicity lemma), but still a hr-word. Hence, 
the number of nonempty subwords of the form (for primitive v and i > 2), 

appearing in the given word, is not smaller than the number of nonempty np- 
squares. 

Observe that Theorem 1 limits the total number of both subwords of the 
form v^' and v'^^~^ by n — 2. 

Hence, the total number of nonempty np-squares appearing in the given word 
is not greater than f — 1, and since it is integer, it is not greater than [|^J — 1. 
On the other hand, this upper bound is reached by the word a". □ 



5 The structure of occurrences of cubic subwords 

In this section we introduce some combinatorial facts about words that are nec- 
essary in the proof of the |n upper bound on the number of cubes in a word of 

length n. 

Lemma 5. Let v'^ and w'^ be two nonempty cubes occurring in a word u at 
positions i and j respectively, such that: 



i < j < i + 



\root{v) 



Then: 



\root{w)\ = \root{v)\ or \root{w)\ > 2 ■ |root(w)| — {j — i — 1) 



Proof. Let us denote p = |root(w)|, q — |root(w)|, and let k be the position of 
the last letter of vfi. 

Case 1. 

Lot us first consider the case, when the (last) occurrence of is totally inside 
v'^ . Observe that k must then be within the last of the three v^s, since otherwise 
would occur in u at position j + p or further (see also Fig. 3). Hence, due to 
Lemma 3, we obtain q = p. 

Case 2. 

In the opposite case, let x be the maximal prefix of that lays inside v^. If 
p q then p + q must be greater than |a;|. Indeed, if p + q < \x\ then both root(ii) 
and root(u)) would be subwords of x, so if p 7^ q, then one of them would not be 
primitive due to the periodicity lemma. Therefore: 

p + q>\x\>\v^\- (j -i)>3p- {j - i) . 

Consequently q > 2p — {j — i) + 1. □ 

Let us introduce a useful notion of p-occurrence. 

Definition 1. A ^(-occurrence is the (last) occurrence of a cube with primitive 
root of length p. 

It turns out that the primitive roots of cubes appearing close to each other 
cannot be arbitrary. It is formally expressed by the following lemma. 

Lemma 6. Let ai, 02, . . . , flp+i be an increasing sequence of positions in a word 
u, such that aj+i < aj + p for j = l,2,...,p. It is not possible for all these 
positions to contain p- occurrences. 

Proof. Let us assume, to the contrary, that at each of the positions 01,02, ■■■ , Op+i 
there is a p-occurrence. Observe that the inequalities from the hypothesis of the 
lemma imply that the primitive roots of cubes occurring at these positions are 
all cyclic rotations of each other. There are only p different rotations of such 
primitive roots; therefore, due to the pigeonhole principle, some two of them 
must be equal. 

It suffices to show that all these cubes have the same length, because then 
some two of them are equal, and consequently one of them is not the last occur- 
rence of the cube. 

Assume to the contrary that some of the considered cubes have different 
lengths. Let Oj and flj+i be two considered positions, such that cubes {v^ and 

respectively) occurring at these positions have different lengths {3kp and 3lp 
respectively, for k^l). Let us consider two cases. 

Case 1. If / < k, then 3fcp — 3/p > 3p, and occurs in u at position aj+i +p 
or further (see Fig. 4). 
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Fig. 4. The positions of cubes and in the case I < k: Oj+i is not the last 
occurrence of . 
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Fig. 5. The positions of cubes and in the case k < I: aj is not the last occurrence 
oft;^ 



Case 2. li k < I, then 3lp — 3kp > 3p and appears in u at position aj +p or 

further (see Fig. 5). 

In both cases we obtain a contradiction. Hence, it is not possible that the 
lengths of the cubes differ. □ 

Let us introduce a notion of independent prefixes. 

Definition 2. We say that v is the independent prefix of u if it is the shortest 
prefix of u that is: 

1. a single letter word, if there is no occurrence of a cube at the first position 
of u, or otherwise 

2. a prefix that ends with a q-occurrence (for some q > I) followed by exactly 
[1] positions without any occurrences (here all occurrences are considered 
within u). 

It is not obvious that the above definition is valid. Therefore, we prove the 
following lemma: 

Lemma 7. For every word u, there exists an independent prefix v of u. 

Proof. If there is no occurrence of a cube at the first position of u, then obviously 
V = u[l]. 



In the opposite case, let us assume — to the contrary — that the indepen- 
dent prefix does not exist. Let q be the maximum such value, that there exists 
a g-occurrence in u, and let i be the rightmost position in u that contains a 
g-occurrence. Prom Lemma 5, [|] positions following i do not contain any oc- 
currences of cubes. Thus, the prefix u[l. .i+ [|]] satisfies the definition of an 
independent prefix — a contradiction. □ 

6 Algorithm Abstract-Simulation 

Let V be the independent prefix of a word u and let |w| > 1. Let (ci)'^j^ be a 
sequence describing the occurrences starting within v. Ci = iff there are no 
occurrences in position u[i], and Cj = g iff there is a g-occurrence in position u[i\. 
We start with the following observations. 

a) If Ci and cj is a pair of consecutive nonzero elements of c (i.e. i < j, Ci,Cj > 
and Cj+i = . . . = Cj-i = 0) then j — i < ■ Indeed, if j — i > [^] , then 
the prefix of u of length i + [^1 or shorter would be an independent prefix 
of u. 

b) For Ci and cj as in a), cj > 2ci — {j — i — 1). This observation is due to 
Lemma 5. 

c) From Lemma 6 and due to a) we have that no g -|- 1 consecutive positive 
elements of c are equal to g. 

From now on, we abstract from the actual word u, and focus only on the 
properties of sequence c. We will analyze the ratio R of nonzero elements of c to 
the length of c. 

Let us observe that if c contains such a pair of equal elements Ci — cj > 0, 
that all the elements between them are equal zero, then all the elements between 
Ci and Cj can be removed from c without decreasing R. Also, if c contains a 
subsequence of consecutive elements equal to g (g > 0) of length less than g 
then this subsequence can be extended to length g without decreasing R. Let c' 
be the sequence obtained from c by performing the described modification steps 
(as many times as possible). Observe that none of these steps violates properties 
a) c). 

Every possible sequence c' can be generated by the (nondeterministic) pseu- 
docode shown below. The following variables are used in the pseudocode: 

— p — the value of the last positive element of c' 

— len the length of the sequence c' without [p/2] trailing zeros 

— occ — the number of positive elements in c' 

— I — the gap between consecutive different positive elements of c' 

— a — the difference between the actual value of a positive element of c' and 
the lower bound from Lemma 5. 

Each step of the repeat loop corresponds to extending sequence c', i.e. adding 
I zeros and p elements of value p. 



3 3 3 5_.^._5 20_^-3 34_.^._34 

5 times 20 times 6 times 34 times 17 times 

Fig. 6. An example of sequence c'. The length of the sequence is 88 and it contains 62 
positive elements. The ratio is 62/88 fs 0.70 < 4/5. 



Note that the algorithm specified by the pseudocode is nondeterministic in 
several different aspects — the initial value of p, the number of steps of the 
repeat loop and values of I and a. 



Algorithm Abstract-Simulation 

p := some positive integer; 
occ:=p; len := p; 
output: p. . .p 

p times 

repeat an arbitrary number of times 

Invariant I {p, occ, I en) : < |. 

I := some integer from interval [0, [f]); 

a := some nonnegative integer; 

p := 2p - I + a; 

occ := occ + p; 

len := len + 1 + p; 

output: P - ■ - P 

I times p times 



7 Upper bound on the number of cubic subwords 

Lemma 8 (Invariant lemma). The following condition I {p, occ, len): 

occ 4 

< 



len + f " 5 

is an invariant of the Abstract-Simulation Algorithm- 
Proof. Before the first execution of the repeat loop, occ = len = p, and conse- 
quently I{p, occ, len) holds: 



Therefore, we only need to prove that if I{p, occ, len) holds then I{p', occ', len') 
also holds, where p', occ' and len' are the values obtained as a result of a single 
step of the repeat loop, i.e.: 



p' = 


2p — I + a, 


occ' 


= occ + 2p — I + a, 


len' 


= len + 2p + a. 



Let us restate I{p', occ', len') equivalently in the following way: 

2p — I a 

5-occ+10p-5l + 5a < 4 • Zen + 8p + 4a + 4 • ^ . (1) 

On the other hand, I{p, occ, len) can be expressed as 

P 

5 • occ < 4 • len + 4 • - . 

Hence, in order to show (1), it is sufficient to prove that: 

10p-5l + 5a < 8p + 4a + 2- {2p-l + a) -2p . (2) 
As a result of some rearrangement, (2) can be expressed as 

< 31 + a 

and this inequality trivially holds. □ 

We can now show the upper bound for the number of cubes in independent 
prefixes. 

Lemma 9. Let v be the independent prefix ofu. The number of different nonempty 
cubes that occur in u and start within v is not greater than | • \v\. 

Proof. Observe that if v satisfies the first condition of Definition 2, then the 
conclusion trivially holds. Therefore, from now on we assume that \v\ > 1. 

As described in the previous section, instead of computing the ratio of cubes 
that occur in u and start within v, we can deal with the ratio R of nonzero 
elements of the corresponding sequence c to the length of c and show that i? < |. 
For this it suffices to prove that for any valid sequence c' the ratio of nonzero 
elements does not exceed |. 

The Abstract-Simulation Algorithm generates every possible sequence c'. 
Hence, in order to prove the | bound, we need to show that inequality 



occ 



4 

< - 
- 5 



holds for every possible execution of the Algorithm. But this inequality is a con- 
sequence of the fact that occ, len) is an invariant of the Algorithm (Lemma 
8). □ 



Theorem 3. The number of different nonempty cubes that occur in a word of 
length n is not greater than |n. 

Proof. We prove the theorem by induction on n. The basis (n = 0) is trivial. 

Now assume that the conchision holds for all words of length not exceeding n and 
consider a word u of length n+1. Due to Lemma 7, there exists the independent 
prefix V of u, V ^ e, u = vw. The cubes occurring within u can be divided into 
two groups: the ones that start within v and the ones that occur totally inside 
w. By Lemma 9, the number of cubes in the first group does not exceed 
and by the inductive hypothesis, cubes(t«) < | ■ In total, there are at most 

4 4 4 

^■\v\ + --\w\<-.\u\ 

cubes within u — this ends the inductive proof. □ 

8 Lower bound on the number of cubic subwords 

A trivial lower bound on the number of different cubic subwords is the word a" 
with [^J cubic occurrences. The table presented in Figure 7 contains examples 
of some words with higher number of cubic subwords. These words have been 
computed using extensive computer experiments. 



n 


word 


#cubes 


ratio 


20 


01110101011011011000 


7 


0.35 


30 


000000110110110101101011010101 


11 


0.36 


40 


1101101101110111011100010001000100100100 


16 


0.40 


50 


11111111110010010010100101001010100101010010101000 


20 


0.40 


60 


10100101001010010101001010010101001010010101001010 
1001010100 


25 


0.41 


70 


00000011011011010110101101010110101101010110101101 

01011010101101010111 


30 


0.42 


80 


11011011010110110101101101011010110101011010110101 
011010110101011010101101010111 


34 


0.42 


90 


11101101101110110110111011011011101101110110110111 
0110111011011011101101110110111011101110 


40 


0.44 


100 


10001010100101010010101001010010101001010010101001 
01001010010101001010010100101010010100101001010111 


44 


0.44 



Fig. 7. Examples of words with high number of distinct cubic subwords. 



Let us proceed to the construction of the |n lower bound. For i > 1, let be 
the word O'lO'+^l. Let qn be the concatenation pip2 ■ ■ -Pn- Thus, for instance, 
94 = 01001001000100010000100001000001. 

Lemma 10. The length of Qn is + 4n. 



Proof. Clearly pi contains 2z + 3 bits, so 

n 

\qn\ = ^2i + 2, = n^+An . 



□ 



Lemma 11. The word qn contains exactly 



n n 



n + 1 



distinct cubes. 



Proof. Note that the concatenation PiPi+i = 0' 10*+^ 10'+^ 10*+^ 1 contains the 
following i + 1 cubes: 

(ono)', (o^-^io^)', (010*)', (lo'+i)'. 

Apart from that, in Qn there are [^J cubes of the form 0^, O^, 0^, . . . Thus far 
we obtained 



n-l 



E(^ + i) + 



i=l 



n + 1 



n + 1 



cubes. 




00010000100001000001 




Fig. 8. For i = 3 the word piPi+i contains 4 cubes of length 3i + 6 = 15. 



It remains to show that there are no more cubes in g„. Notice that we have 
considered all cubes for which the number of I's in u equals or 1. On the 
other hand, if this number exceeds 1 then u would contain the factor lO'l for 
some i>l and this is impossible, since for a given i such a factor appears within 
Qn at most twice. □ 

Theorem 4. For infinitely many positive integers m there exists a word of 
length m for which the number of cubes is ^ — o(m) . 

Proof. Due to Lemmas 10 and 11, for any word g„ we have: 



\Qn 

2 



cubes (g„) 



n 

T 



2n- 



3 



n 

Y 



n 
2 



1 



+ 1 



0{n)=o{\qn\) . 



Thus, cubes(g„) = - o(|(j„|). □ 

Interestingly, the example from the paper [11] of a family of words that 
contain m — o(m) squares is quite similar to our example, but instead of pi it 
utilizes words of the form p'^ = 0'+^ 10*10*+^ 1. 

9 Conclusions 

In this paper we prove a tight bound for the number of nonprimitive squares 
in a word of length n. Unfortunately, this does not improve the overall bound 
of the number of squares — the main open problem is improving the bound for 
primitive squares. 

We also give some estimations of the number of cubes in a string of length 
n. These bounds are much better than the best known estimations for squares 
in general. We believe that at least the upper bound established in our paper is 
not tight. 
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